Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems
نویسندگان
چکیده
Correlation Clustering was defined by Bansal, Blum, and Chawla as the problem of clustering a set of elements based on a possibly inconsistent binary similarity function between element pairs. Their setting is agnostic in the sense that a ground truth clustering is not assumed to exist, and the only reasonable way to measure the cost of a solution is by comparing it with the input similarity function. This problem has been studied in theory and application and has been subsequently proven to be APX-Hard. In this work we assume that there does exist an unknown correct clustering of the data. This is the case in applications such as record linkage in databases. In this setting, we argue that it is more reasonable to measure accuracy of the output clustering against the unknown underlying true clustering. This corresponds to the intuition that in real life an action is penalized or rewarded based on reality and not on our noisy perception thereof. The traditional combinatorial optimization version of the problem only offers an indirect solution to our revisited version via a triangle inequality argument applied to the distances between the output clustering, the input similarity function and the underlying ground truth. In the revisited version, we show that it is possible to shortcut the traditional optimization detour and obtain a factor 2 approximation. This factor could not have possibly been obtained by using a solution to the traditional problem as a black box, unless it was an exact optimal solution. Our result therefore shortcuts the APX-Hardness, and could be useful for revisiting many other combinatorial optimization problems. Our analysis consists of two solutions. The first gives a simple 2-approximation algorithm. The second involves a novel way to continuously morph a general (non-metric) distance function into a metric. This technique is interesting in its own right and may be useful for other metric embedding problems. The resulting morphed solution is randomly rounded into a clustering. En route, in certain cases we obtain a certificate for the possibility of getting a solution of factor strictly less than 2. Finally, we show simple cases in which randomness is necessary for achieving a solution of factor strictly less than 2, thus justifying the use of randomization in our solution.
منابع مشابه
Magnetic Calibration of Three-Axis Strapdown Magnetometers for Applications in Mems Attitude-Heading Reference Systems
In a strapdown magnetic compass, heading angle is estimated using the Earth's magnetic field measured by Three-Axis Magnetometers (TAM). However, due to several inevitable errors in the magnetic system, such as sensitivity errors, non-orthogonal and misalignment errors, hard iron and soft iron errors, measurement noises and local magnetic fields, there are large error between the magnetometers'...
متن کاملA Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملThe Effects of Newmark Method Parameters on Errors in Dynamic Extended Finite Element Method Using Response Surface Method
The Newmark method is an effective method for numerical time integration in dynamic problems. The results of Newmark method are function of its parameters (β, γ and ∆t). In this paper, a stationary mode I dynamic crack problem is coded in extended finite element method )XFEM( framework in Matlab software and results are verified with analytical solution. This paper focuses on effects of main pa...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملEnergy cost minimization in an electric vehicle solar charging station via dynamic programming
Environmental crisis and shortage of fossil fuels make Electric Vehicles (EVs) alternatives for conventional vehicles. With growing numbers of EVs, the coordinated charging is necessary to prevent problems such as large peaks and power losses for grid and to minimize charging costs of EVs for EV owners. Therefore, this paper proposes an optimal charging schedule based on Dynamic Programming (DP...
متن کامل